Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
1.
Syst Biol ; 2023 May 26.
Article in English | MEDLINE | ID: covidwho-20238153

ABSTRACT

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established maximum likelihood implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar datasets with particularly dense sampling and short branch lengths.

2.
Nat Genet ; 55(5): 746-752, 2023 05.
Article in English | MEDLINE | ID: covidwho-2322683

ABSTRACT

Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus's origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present 'MAximum Parsimonious Likelihood Estimation' (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.


Subject(s)
COVID-19 , Humans , Phylogeny , COVID-19/epidemiology , COVID-19/genetics , SARS-CoV-2/genetics , Likelihood Functions , Pandemics , Bayes Theorem
3.
Microb Genom ; 9(5)2023 05.
Article in English | MEDLINE | ID: covidwho-2318756

ABSTRACT

Exposure to different mutagens leaves distinct mutational patterns that can allow inference of pathogen replication niches. We therefore investigated whether SARS-CoV-2 mutational spectra might show lineage-specific differences, dependent on the dominant site(s) of replication and onwards transmission, and could therefore rapidly infer virulence of emergent variants of concern (VOCs). Through mutational spectrum analysis, we found a significant reduction in G>T mutations in the Omicron variant, which replicates in the upper respiratory tract (URT), compared to other lineages, which replicate in both the URT and lower respiratory tract (LRT). Mutational analysis of other viruses and bacteria indicates a robust, generalizable association of high G>T mutations with replication within the LRT. Monitoring G>T mutation rates over time, we found early separation of Omicron from Beta, Gamma and Delta, while mutational patterns in Alpha varied consistent with changes in transmission source as social restrictions were lifted. Mutational spectra may be a powerful tool to infer niches of established and emergent pathogens.


Subject(s)
COVID-19 , Humans , SARS-CoV-2/genetics , Mutation , Bacteria/genetics , Lung
4.
Bioinformatics ; 2022 Dec 01.
Article in English | MEDLINE | ID: covidwho-2242322

ABSTRACT

SUMMARY: Treenome Browser is a web browser tool to interactively visualize millions of genomes alongside huge phylogenetic trees. AVAILABILITY AND IMPLEMENTATION: Treenome Browser for SARS-CoV-2 can be accessed at cov2tree.org, or at taxonium.org for user-provided trees. Source code and documentation are available at github.com/theosanderson/taxonium and docs.taxonium.org/en/latest/treenome.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
PLoS Comput Biol ; 18(8): e1010409, 2022 08.
Article in English | MEDLINE | ID: covidwho-2002267

ABSTRACT

Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape.


Subject(s)
COVID-19 , Pandemics , COVID-19/epidemiology , Computer Simulation , Humans , SARS-CoV-2/genetics , Software
6.
Virus Evol ; 8(1): veac048, 2022.
Article in English | MEDLINE | ID: covidwho-1997077

ABSTRACT

The unprecedented severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) global sequencing effort has suffered from an analytical bottleneck. Many existing methods for phylogenetic analysis are designed for sparse, static datasets and are too computationally expensive to apply to densely sampled, rapidly expanding datasets when results are needed immediately to inform public health action. For example, public health is often concerned with identifying clusters of closely related samples, but the sheer scale of the data prevents manual inspection and the current computational models are often too expensive in time and resources. Even when results are available, intuitive data exploration tools are of critical importance to effective public health interpretation and action. To help address this need, we present a phylogenetic heuristic that quickly and efficiently identifies newly introduced strains in a region, resulting in clusters of infected individuals, and their putative geographic origins. We show that this approach performs well on simulated data and yields results largely congruent with more sophisticated Bayesian phylogeographic modeling approaches. We also introduce Cluster-Tracker (https://clustertracker.gi.ucsc.edu/), a novel interactive web-based tool to facilitate effective and intuitive SARS-CoV-2 geographic data exploration and visualization across the USA. Cluster-Tracker is updated daily and automatically identifies and highlights groups of closely related SARS-CoV-2 infections resulting from the transmission of the virus between two geographic areas by travelers, streamlining public health tracking of local viral diversity and emerging infection clusters. The site is open-source and designed to be easily configured to analyze any chosen region, making it a useful resource globally. The combination of these open-source tools will empower detailed investigations of the geographic origins and spread of SARS-CoV-2 and other densely sampled pathogens.

7.
Nature ; 609(7929): 994-997, 2022 09.
Article in English | MEDLINE | ID: covidwho-1991628

ABSTRACT

Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses1-4. During the SARS-CoV-2 pandemic, genomic data generation has exceeded the capacities of existing analysis platforms, thereby crippling real-time analysis of viral evolution5. Here, we use a new phylogenomic method to search a nearly comprehensive SARS-CoV-2 phylogeny for recombinant lineages. In a 1.6 million sample tree from May 2021, we identify 589 recombination events, which indicate that around 2.7% of sequenced SARS-CoV-2 genomes have detectable recombinant ancestry. Recombination breakpoints are inferred to occur disproportionately in the 3' portion of the genome that contains the spike protein. Our results highlight the need for timely analyses of recombination for pinpointing the emergence of recombinant lineages with the potential to increase transmissibility or virulence of the virus. We anticipate that this approach will empower comprehensive real-time tracking of viral recombination during the SARS-CoV-2 pandemic and beyond.


Subject(s)
COVID-19 , Genome, Viral , Pandemics , Phylogeny , Recombination, Genetic , SARS-CoV-2 , COVID-19/epidemiology , COVID-19/transmission , COVID-19/virology , Genome, Viral/genetics , Humans , Mutation , Recombination, Genetic/genetics , SARS-CoV-2/genetics , SARS-CoV-2/pathogenicity , Selection, Genetic/genetics , Spike Glycoprotein, Coronavirus/genetics , Virulence/genetics
8.
Bioinformatics ; 38(15): 3734-3740, 2022 Aug 02.
Article in English | MEDLINE | ID: covidwho-1901115

ABSTRACT

MOTIVATION: Phylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the coronavirus disease 2019 (COVID-19) pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic. RESULTS: Here, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences. AVAILABILITY AND IMPLEMENTATION: The matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , Phylogeny , SARS-CoV-2/genetics , Pandemics , Software
9.
PLoS Comput Biol ; 18(4): e1010056, 2022 04.
Article in English | MEDLINE | ID: covidwho-1833504

ABSTRACT

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.


Subject(s)
COVID-19 , Pandemics , Algorithms , COVID-19/epidemiology , Computer Simulation , Evolution, Molecular , Humans , Phylogeny , SARS-CoV-2/genetics , Software
11.
MMWR Morb Mortal Wkly Rep ; 70(35): 1214-1219, 2021 Sep 03.
Article in English | MEDLINE | ID: covidwho-1412877

ABSTRACT

On May 25, 2021, the Marin County Department of Public Health (MCPH) was notified by an elementary school that on May 23, an unvaccinated teacher had reported receiving a positive test result for SARS-CoV-2, the virus that causes COVID-19. The teacher reported becoming symptomatic on May 19, but continued to work for 2 days before receiving a test on May 21. On occasion during this time, the teacher read aloud unmasked to the class despite school requirements to mask while indoors. Beginning May 23, additional cases of COVID-19 were reported among other staff members, students, parents, and siblings connected to the school. To characterize the outbreak, on May 26, MCPH initiated case investigation and contact tracing that included whole genome sequencing (WGS) of available specimens. A total of 27 cases were identified, including that of the teacher. During May 23-26, among the teacher's 24 students, 22 students, all ineligible for vaccination because of age, received testing for SARS-CoV-2; 12 received positive test results. The attack rate in the two rows seated closest to the teacher's desk was 80% (eight of 10) and was 28% (four of 14) in the three back rows (Fisher's exact test; p = 0.036). During May 24-June 1, six of 18 students in a separate grade at the school, all also too young for vaccination, received positive SARS-CoV-2 test results. Eight additional cases were also identified, all in parents and siblings of students in these two grades. Among these additional cases, three were in persons fully vaccinated in accordance with CDC recommendations (1). Among the 27 total cases, 22 (81%) persons reported symptoms; the most frequently reported symptoms were fever (41%), cough (33%), headache (26%), and sore throat (26%). WGS of all 18 available specimens identified the B.1.617.2 (Delta) variant. Vaccines are effective against the Delta variant (2), but risk of transmission remains elevated among unvaccinated persons in schools without strict adherence to prevention strategies. In addition to vaccination for eligible persons, strict adherence to nonpharmaceutical prevention strategies, including masking, routine testing, facility ventilation, and staying home when symptomatic, are important to ensure safe in-person learning in schools (3).


Subject(s)
COVID-19/epidemiology , COVID-19/virology , Disease Outbreaks , SARS-CoV-2/isolation & purification , Schools , Adult , COVID-19/prevention & control , COVID-19/transmission , COVID-19 Vaccines/administration & dosage , California/epidemiology , Child , Contact Tracing , Humans , Masks/statistics & numerical data , School Teachers/statistics & numerical data
12.
BMC Res Notes ; 14(1): 189, 2021 May 17.
Article in English | MEDLINE | ID: covidwho-1388823

ABSTRACT

OBJECTIVE: The SARS-CoV-2 pandemic has prompted one of the most extensive and expeditious genomic sequencing efforts in history. Each viral genome is accompanied by a set of metadata which supplies important information such as the geographic origin of the sample, age of the host, and the lab at which the sample was sequenced, and is integral to epidemiological efforts and public health direction. Here, we interrogate some shortcomings of metadata within the GISAID database to raise awareness of common errors and inconsistencies that may affect data-driven analyses and provide possible avenues for resolutions. RESULTS: Our analysis reveals a startling prevalence of spelling errors and inconsistent naming conventions, which together occur in an estimated ~ 9.8% and ~ 11.6% of "originating lab" and "submitting lab" GISAID metadata entries respectively. We also find numerous ambiguous entries which provide very little information about the actual source of a sample and could easily associate with multiple sources worldwide. Importantly, all of these issues can impair the ability and accuracy of association studies by deceptively causing a group of samples to identify with multiple sources when they truly all identify with one source, or vice versa.


Subject(s)
COVID-19 , SARS-CoV-2 , Genome, Viral/genetics , Genomics , Humans , Metadata , Phylogeny
13.
PLoS Genet ; 16(11): e1009175, 2020 11.
Article in English | MEDLINE | ID: covidwho-1388878

ABSTRACT

The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.


Subject(s)
Genome, Viral/genetics , Phylogeny , SARS-CoV-2/genetics , Algorithms , COVID-19 , Computational Biology , Evolution, Molecular , Humans , RNA, Viral/genetics , Sequence Alignment , Whole Genome Sequencing
14.
Mol Biol Evol ; 38(12): 5819-5824, 2021 12 09.
Article in English | MEDLINE | ID: covidwho-1381034

ABSTRACT

The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting, and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.


Subject(s)
Evolution, Molecular , Phylogeny , SARS-CoV-2 , COVID-19/virology , Humans , Mutation , SARS-CoV-2/genetics , Software
16.
Nat Genet ; 53(6): 809-816, 2021 06.
Article in English | MEDLINE | ID: covidwho-1223103

ABSTRACT

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering in a new era of 'genomic contact tracing'-that is, using viral genomes to trace local transmission dynamics. However, because the viral phylogeny is already so large-and will undoubtedly grow many fold-placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach greatly improves the speed of phylogenetic placement of new samples and data visualization, making it possible to complete the placements under the constraints of real-time contact tracing. Thus, our method addresses an important need for maintaining a fully updated reference phylogeny. We make these tools available to the research community through the University of California Santa Cruz SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for SARS-CoV-2 specifically for laboratories worldwide.


Subject(s)
COVID-19/epidemiology , COVID-19/virology , Computational Biology/methods , Phylogeny , SARS-CoV-2/classification , SARS-CoV-2/genetics , Software , Algorithms , Computational Biology/standards , Databases, Genetic , Genome, Viral , Humans , Molecular Sequence Annotation , Mutation , Web Browser
17.
Genome Biol Evol ; 13(5)2021 05 07.
Article in English | MEDLINE | ID: covidwho-1199488

ABSTRACT

The COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolution, for vaccine design, and for the tracking of viral spread. We highlight and address some common genome sequence analysis pitfalls that can lead to inaccurate inference of mutation rates and selection, such as ignoring skews in the genetic code, not accounting for recurrent mutations, and assuming evolutionary equilibrium. We find that two particular mutation rates, G →U and C →U, are similarly elevated and considerably higher than all other mutation rates, causing the majority of mutations in the SARS-CoV-2 genome, and are possibly the result of APOBEC and ROS activity. These mutations also tend to occur many times at the same genome positions along the global SARS-CoV-2 phylogeny (i.e., they are very homoplasic). We observe an effect of genomic context on mutation rates, but the effect of the context is overall limited. Although previous studies have suggested selection acting to decrease U content at synonymous sites, we bring forward evidence suggesting the opposite.


Subject(s)
Mutation Rate , SARS-CoV-2/genetics , Selection, Genetic , Silent Mutation/genetics , COVID-19/virology , Evolution, Molecular , Genome, Viral , Phylogeny , RNA, Viral/genetics , SARS-CoV-2/classification , Sequence Analysis, RNA
18.
bioRxiv ; 2020 Sep 28.
Article in English | MEDLINE | ID: covidwho-835238

ABSTRACT

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of "genomic contact tracing" - that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large - and will undoubtedly grow many fold - placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient, tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach improves the speed of phylogenetic placement of new samples and data visualization by orders of magnitude, making it possible to complete the placements under real-time constraints. Our method also provides the key ingredient for maintaining a fully-updated reference phylogeny. We make these tools available to the research community through the UCSC SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for laboratories worldwide. SOFTWARE AVAILABILITY: USHER is available to users through the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgPhyloPlace . The source code and detailed instructions on how to compile and run UShER are available from https://github.com/yatisht/usher .

SELECTION OF CITATIONS
SEARCH DETAIL